AITopics | Aegean Sea

Collaborating Authors

Aegean Sea

RePro: Training Language Models to Faithfully Recycle the Web for Pretraining

arXiv.org Artificial IntelligenceOct-15-2025

High-quality pretraining data is the fossil fuel of large language models (LLMs), yet its reserves are running low for frontier models. In this paper, we introduce RePro, a novel web recycling method that trains a relatively small LM with reinforcement learning to generate effective and faithful rephrasings of pretraining data. Specifically, we design one quality reward and three faithfulness rewards, optimizing the LM rephraser to convert organic data into high-quality rephrasings while maintaining its core semantics and structure. In our experiment, we train a 4B rephraser to recycle 72B tokens sampled from DCLM-RefinedWeb. Pretraining results on 400M and 1.4B models demonstrate that RePro delivers 4.7%-14.0% relative accuracy gains over organic-only baseline on 22 downstream tasks. RePro also outperforms ReWire, the state-of-the-art web recycling method that prompts a 70B rephraser, as well as the organic baseline with a 4x larger data pool. Experiments with different amounts of recycled data highlight that RePro improves organic data efficiency by 2-3x. Individual and distributional analyses validate that RePro preserves more critical information and faithfully reflects the characteristics of organic data compared to prompting-based methods. Together, these results show that RePro provides an efficient and controllable path to effectively harness the fossil fuel of LLM pretraining. We open-source our code, rephraser, and recycled data at https://github.com/cxcscmu/RePro.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2510.10681

Country:

Asia > Middle East (0.28)
North America > United States (0.28)
Atlantic Ocean > Mediterranean Sea > Aegean Sea > Sea of Marmara (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Government > Military (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

MedFormer: a data-driven model for forecasting the Mediterranean Sea

Epicoco, Italo, Donno, Davide, Accarino, Gabriele, Norberti, Simone, Grandi, Alessandro, Giurato, Michele, McAdam, Ronan, Elia, Donatello, Clementi, Emanuela, Nassisi, Paola, Scoccimarro, Enrico, Coppini, Giovanni, Gualdi, Silvio, Aloisio, Giovanni, Masina, Simona, Boccaletti, Giulio, Navarra, Antonio

arXiv.org Artificial IntelligenceSep-3-2025

Accurate ocean forecasting is essential for supporting a wide range of marine applications. Recent advances in artificial intelligence have highlighted the potential of data-driven models to outperform traditional numerical approaches, particularly in atmospheric weather forecasting. However, extending these methods to ocean systems remains challenging due to their inherently slower dynamics and complex boundary conditions. In this work, we present MedFormer, a fully data-driven deep learning model specifically designed for medium-range ocean forecasting in the Mediterranean Sea. MedFormer is based on a U-Net architecture augmented with 3D attention mechanisms and operates at a high horizontal resolution of 1/24°. The model is trained on 20 years of daily ocean reanalysis data and fine-tuned with high-resolution operational analyses. It generates 9-day forecasts using an autoregressive strategy. The model leverages both historical ocean states and atmospheric forcings, making it well-suited for operational use. We benchmark MedFormer against the state-of-the-art Mediterranean Forecasting System (MedFS), developed at Euro-Mediterranean Center on Climate Change (CMCC), using both analysis data and independent observations. The forecast skills, evaluated with the Root Mean Squared Difference and the Anomaly Correlation Coefficient, indicate that MedFormer consistently outperforms MedFS across key 3D ocean variables. These findings underscore the potential of data-driven approaches like MedFormer to complement, or even surpass, traditional numerical ocean forecasting systems in both accuracy and computational efficiency.

artificial intelligence, machine learning, medfs, (18 more...)

arXiv.org Artificial Intelligence

2509.00015

Country: Atlantic Ocean > Mediterranean Sea > Aegean Sea (0.14)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

Song, Wei, Wang, Yuran, Song, Zijia, Li, Yadong, Sun, Haoze, Chen, Weipeng, Zhou, Zenan, Xu, Jianhua, Wang, Jiaqi, Yu, Kaicheng

arXiv.org Artificial IntelligenceMar-19-2025

The differing representation spaces required for visual understanding and generation pose a challenge in unifying them within the autoregressive paradigm of large language models. A vision tokenizer trained for reconstruction excels at capturing low-level perceptual details, making it well-suited for visual generation but lacking high-level semantic representations for understanding tasks. Conversely, a vision encoder trained via contrastive learning aligns well with language but struggles to decode back into the pixel space for generation tasks. To bridge this gap, we propose DualToken, a method that unifies representations for both understanding and generation within a single tokenizer. However, directly integrating reconstruction and semantic objectives in a single tokenizer creates conflicts, leading to degraded performance in both reconstruction quality and semantic performance. Instead of forcing a single codebook to handle both semantic and perceptual information, DualToken disentangles them by introducing separate codebooks for high and low-level features, effectively transforming their inherent conflict into a synergistic relationship. As a result, DualToken achieves state-of-the-art performance in both reconstruction and semantic tasks while demonstrating remarkable effectiveness in downstream MLLM understanding and generation tasks. Notably, we also show that DualToken, as a unified tokenizer, surpasses the naive combination of two distinct types vision encoders, providing superior performance within a unified MLLM.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2503.14324

Country:

Asia > China > Shanghai > Shanghai (0.04)
Europe > Iceland (0.04)
Europe > Greece (0.04)
(2 more...)

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.91)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.67)

Add feedback

FLP-XR: Future Location Prediction on Extreme Scale Maritime Data in Real-time

Theodoropoulos, George S., Patakis, Andreas, Tritsarolis, Andreas, Theodoridis, Yannis

arXiv.org Artificial IntelligenceMar-19-2025

Movements of maritime vessels are inherently complex and challenging to model due to the dynamic and often unpredictable nature of maritime operations. Even within structured maritime environments, such as shipping lanes and port approaches, where vessels adhere to navigational rules and predefined sea routes, uncovering underlying patterns is far from trivial. The necessity for accurate modeling of the mobility of maritime vessels arises from the numerous applications it serves, including risk assessment for collision avoidance, optimization of shipping routes, and efficient port management. This paper introduces FLP-XR, a model that leverages maritime mobility data to construct a robust framework that offers precise predictions while ensuring extremely fast training and inference capabilities. We demonstrate the efficiency of our approach through an extensive experimental study using three real-world AIS datasets. According to the experimental results, FLP-XR outperforms the current state-of-the-art in many cases, whereas it performs 2-3 orders of magnitude faster in terms of training and inference.

data mining, machine learning, real time system, (20 more...)

arXiv.org Artificial Intelligence

2503.13491

Country:

Europe > Greece (0.05)
North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.04)
Europe > Netherlands > South Holland > Rotterdam (0.04)
(4 more...)

Genre: Research Report > New Finding (0.67)

Industry:

Information Technology (1.00)
Transportation > Marine (0.86)
Transportation > Freight & Logistics Services > Shipping (0.54)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Architecture > Real Time Systems (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)

Add feedback

The Illusion of Rights based AI Regulation

Mei, Yiyang, Sag, Matthew

arXiv.org Artificial IntelligenceFeb-26-2025

Whether and how to regulate AI is one of the defining questions of our times - a question that is being debated locally, nationally, and internationally. We argue that much of this debate is proceeding on a false premise. Specifically, our article challenges the prevailing academic consensus that the European Union's AI regulatory framework is fundamentally rights-driven and the correlative presumption that other rights-regarding nations should therefore follow Europe's lead in AI regulation. Rather than taking rights language in EU rules and regulations at face value, we show how EU AI regulation is the logical outgrowth of a particular cultural, political, and historical context. We show that although instruments like the General Data Protection Regulation (GDPR) and the AI Act invoke the language of fundamental rights, these rights are instrumentalized - used as rhetorical cover for governance tools that address systemic risks and maintain institutional stability. As such, we reject claims that the EU's regulatory framework and the substance of its rules should be adopted as universal imperatives and transplanted to other liberal democracies. To add weight to our argument from historical context, we conduct a comparative analysis of AI regulation in five contested domains: data privacy, cybersecurity, healthcare, labor, and misinformation. This EU-US comparison shows that the EU's regulatory architecture is not meaningfully rights-based. Our article's key intervention in AI policy debates is not to suggest that the current American regulatory model is necessarily preferable but that the presumed legitimacy of the EU's AI regulatory approach must be abandoned.

llusion, regulation, rights, (14 more...)

arXiv.org Artificial Intelligence

2503.05784

Country:

North America > United States > New York (0.04)
North America > United States > Illinois (0.04)
Europe > Russia (0.04)
(29 more...)

Genre:

Overview (0.92)
Research Report (0.82)

Industry:

Law > Statutes (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine > Health Care Providers & Services > Reimbursement (1.00)
(3 more...)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)

Add feedback

Regional Ocean Forecasting with Hierarchical Graph Neural Networks

Holmberg, Daniel, Clementi, Emanuela, Roos, Teemu

arXiv.org Artificial IntelligenceNov-20-2024

Accurate ocean forecasting systems are vital for understanding marine dynamics, which play a crucial role in environmental management and climate adaptation strategies. Traditional numerical solvers, while effective, are computationally expensive and time-consuming. Recent advancements in machine learning have revolutionized weather forecasting, offering fast and energy-efficient alternatives. Building on these advancements, we introduce SeaCast, a neural network designed for high-resolution, medium-range ocean forecasting. SeaCast employs a graph-based framework to effectively handle the complex geometry of ocean grids and integrates external forcing data tailored to the regional ocean context. Our approach is validated through experiments at a high spatial resolution using the operational numerical model of the Mediterranean Sea provided by the Copernicus Marine Service, along with both numerical and data-driven atmospheric forcings.

forcing, forecast, seacast, (15 more...)

arXiv.org Artificial Intelligence

2410.11807

Country:

Europe > Finland > Uusimaa > Helsinki (0.04)
Europe > Gibraltar (0.04)
Atlantic Ocean > Mediterranean Sea > Aegean Sea > Sea of Marmara > Dardanelles (0.04)
(13 more...)

Genre: Research Report (1.00)

Industry: Transportation > Marine (0.34)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Few-Shot Joint Multimodal Entity-Relation Extraction via Knowledge-Enhanced Cross-modal Prompt Model

Yuan, Li, Cai, Yi, Huang, Junsheng

arXiv.org Artificial IntelligenceOct-18-2024

Joint Multimodal Entity-Relation Extraction (JMERE) is a challenging task that aims to extract entities and their relations from text-image pairs in social media posts. Existing methods for JMERE require large amounts of labeled data. However, gathering and annotating fine-grained multimodal data for JMERE poses significant challenges. Initially, we construct diverse and comprehensive multimodal few-shot datasets fitted to the original data distribution. To address the insufficient information in the few-shot setting, we introduce the \textbf{K}nowledge-\textbf{E}nhanced \textbf{C}ross-modal \textbf{P}rompt \textbf{M}odel (KECPM) for JMERE. This method can effectively address the problem of insufficient information in the few-shot setting by guiding a large language model to generate supplementary background knowledge. Our proposed method comprises two stages: (1) a knowledge ingestion stage that dynamically formulates prompts based on semantic similarity guide ChatGPT generating relevant knowledge and employs self-reflection to refine the knowledge; (2) a knowledge-enhanced language model stage that merges the auxiliary knowledge with the original input and utilizes a transformer-based model to align with JMERE's required output format. We extensively evaluate our approach on a few-shot dataset derived from the JMERE dataset, demonstrating its superiority over strong baselines in terms of both micro and macro F$_1$ scores. Additionally, we present qualitative analyses and case studies to elucidate the effectiveness of our model.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2410.14225

Country:

Oceania > Australia > Victoria > Melbourne (0.15)
Europe > Greece (0.05)
Europe > Spain > Galicia > Madrid (0.04)
(6 more...)

Genre: Research Report (0.82)

Industry:

Government > Regional Government > North America Government > United States Government (0.46)
Leisure & Entertainment > Sports > Soccer (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Dr.Academy: A Benchmark for Evaluating Questioning Capability in Education for Large Language Models

Chen, Yuyan, Wu, Chenwei, Yan, Songzhou, Liu, Panjun, Zhou, Haoyu, Xiao, Yanghua

arXiv.org Artificial IntelligenceAug-20-2024

Teachers are important to imparting knowledge and guiding learners, and the role of large language models (LLMs) as potential educators is emerging as an important area of study. Recognizing LLMs' capability to generate educational content can lead to advances in automated and personalized learning. While LLMs have been tested for their comprehension and problem-solving skills, their capability in teaching remains largely unexplored. In teaching, questioning is a key skill that guides students to analyze, evaluate, and synthesize core concepts and principles. Therefore, our research introduces a benchmark to evaluate the questioning capability in education as a teacher of LLMs through evaluating their generated educational questions, utilizing Anderson and Krathwohl's taxonomy across general, monodisciplinary, and interdisciplinary domains. We shift the focus from LLMs as learners to LLMs as educators, assessing their teaching capability through guiding them to generate questions. We apply four metrics, including relevance, coverage, representativeness, and consistency, to evaluate the educational quality of LLMs' outputs. Our results indicate that GPT-4 demonstrates significant potential in teaching general, humanities, and science courses; Claude2 appears more apt as an interdisciplinary teacher. Furthermore, the automatic scores align with human perspectives.

huxley, opération, python 3, (13 more...)

arXiv.org Artificial Intelligence

2408.10947

Country:

Africa > Middle East > Egypt (0.04)
Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
(23 more...)

Genre:

Research Report > New Finding (0.47)
Personal > Interview (0.46)
Instructional Material > Course Syllabus & Notes (0.34)

Industry:

Materials > Chemicals > Industrial Gases (1.00)
Education > Assessment & Standards (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.92)

Add feedback

Development of an AI Anti-Bullying System Using Large Language Model Key Topic Detection

Tassava, Matthew, Kolodjski, Cameron, Milbrath, Jordan, Bishop, Adorah, Flanders, Nathan, Fetsch, Robbie, Hanson, Danielle, Straub, Jeremy

arXiv.org Artificial IntelligenceAug-19-2024

It has become a pronounced problem due to the increasing ubiquity of online platforms that provide a means to conduct it. A significant amount of this cyberbullying is conducted by and targets teenagers. It is difficult for teenage students to shut themselves off from the digital world in which the cyberbullying is taking place. Given how entrenched the use of digital apps is by today's youth, and the pronounced consequences of it - including victim self-harm, in some cases - cyberbullying is at least as much of a threat as physical bullying. Additionally, because of the obfuscation caused by the online environment, authorities (such as parents, teachers and law enforcement) may have difficulty determining what has occurred and who the actors participating are.

json object, only respond, system prompt, (14 more...)

arXiv.org Artificial Intelligence

2408.10417

Country:

Africa (0.04)
Oceania > New Zealand (0.04)
Europe > Italy > Tuscany (0.04)
(9 more...)

Genre: Research Report (0.63)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
Education > Educational Setting > K-12 Education (0.45)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.83)

Add feedback

IBB Traffic Graph Data: Benchmarking and Road Traffic Prediction Model

Olug, Eren, Kaya, Kiymet, Tugay, Resul, Oguducu, Sule Gunduz

arXiv.org Artificial IntelligenceAug-2-2024

Road traffic congestion prediction is a crucial component of intelligent transportation systems, since it enables proactive traffic management, enhances suburban experience, reduces environmental impact, and improves overall safety and efficiency. Although there are several public datasets, especially for metropolitan areas, these datasets may not be applicable to practical scenarios due to insufficiency in the scale of data (i.e. number of sensors and road links) and several external factors like different characteristics of the target area such as urban, highways and the data collection location. To address this, this paper introduces a novel IBB Traffic graph dataset as an alternative benchmark dataset to mitigate these limitations and enrich the literature with new geographical characteristics. IBB Traffic graph dataset covers the sensor data collected at 2451 distinct locations. Moreover, we propose a novel Road Traffic Prediction Model that strengthens temporal links through feature engineering, node embedding with GLEE to represent inter-related relationships within the traffic network, and traffic prediction with ExtraTrees. The results indicate that the proposed model consistently outperforms the baseline models, demonstrating an average accuracy improvement of 4%.

dataset, node, prediction, (12 more...)

arXiv.org Artificial Intelligence

2408.01016

Country:

Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.06)
Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.06)
North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.04)
(2 more...)

Genre: Research Report (0.83)

Industry: Transportation > Infrastructure & Services (0.35)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback